New York City Bike-share Data Analysis using R

Introduction

Citi Bike is New York City’s bike share system, and the largest in the nation. The case study involves Citi Bike data of its customer’s trip details over a 12-month period (January 2020 - December 2020). The data has been made available by Citi-Bike at https://s3.amazonaws.com/tripdata/index.html

Scenario (I made it up for the purpose of the case study)

Marketing team needs to design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ.

Objective

Citi-Bike wants to analyze their user data to find the main differences in behaviour between their two types of users, the “casual” who pays for each ride or a day pass, and the annual member who pays a yearly subscription to the service.

Stakeholders:

The main stakeholders here are the Director of Marketing and my manager Ms Jackson, the rest of the marketing analytics team, and the Cyclistic executive team. Data Sources A total of 12 datasets have been made available for each month starting from April 2020 to March 2021. Each dataset captures the details of every ride logged by the customers of Cyclistic. This data that has been made publicly available has been scrubbed to omit rider’s personal information.

Challenges with the data

The combined size of all the 12 datasets is 3.53 GB. This dataset is too large for spreadsheet. The choice was between R and SQL, but I am choosing R simply because I could do data wrangling, analysis and visualizations in the same platform. However, even in R this dataset presents a challenge, first I try it on Rstudio cloud by combining the months by quarter, but Rstudio cloud could not handle the computation because the processing power is limited to 1 GB. I have to install Rstudio on my laptop that has 16 GB of memory. When I was working on the project, if I look at processing of Rstudio some times it goes close to 8 GB. Many routine calculations for small datasets trow error on this dataset. A dataset of this size is better suited BigQuery, Dataproc, Databricks, SAS Viya, … What I want to say is that, it is easier to do this case study in a big data platform.

We started with 19,506,857 rows and 15 columns with a size of 3.53 GB, and end up with 19,484,799 rows and 17 columns with a size of 3.1 GB.

Installing and loading necessary packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
library(readr)
library(ggplot2)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.

Importing the datasets by month

January2020 <- read_csv("202001-citibike-tripdata.csv")
## Rows: 1240596 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
February2020 <- read_csv("202002-citibike-tripdata.csv")
## Rows: 1146830 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
March2020 <- read_csv("202003-citibike-tripdata.csv")
## Rows: 1068457 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
April2020 <- read_csv("202004-citibike-tripdata.csv")
## Rows: 682762 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
May2020 <- read_csv("202005-citibike-tripdata.csv")
## Rows: 1487890 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
June2020 <- read_csv("202006-citibike-tripdata.csv")
## Rows: 1882273 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
July2020 <- read_csv("202007-citibike-tripdata.csv")
## Rows: 2105808 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
August2020 <- read_csv("202008-citibike-tripdata.csv")
## Rows: 2329514 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
September2020 <- read_csv("202009-citibike-tripdata.csv")
## Rows: 2488225 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
October2020 <- read_csv("202010-citibike-tripdata.csv")
## Rows: 2248869 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
November2020 <- read_csv("202011-citibike-tripdata.csv")
## Rows: 1736704 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
December2020 <- read_csv("202012-citibike-tripdata.csv")
## Rows: 1088929 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): start station name, end station name, usertype
## dbl  (10): tripduration, start station id, start station latitude, start sta...
## dttm  (2): starttime, stoptime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

All 12 datasets have identical column names, which are

colnames(January2020)
##  [1] "tripduration"            "starttime"              
##  [3] "stoptime"                "start station id"       
##  [5] "start station name"      "start station latitude" 
##  [7] "start station longitude" "end station id"         
##  [9] "end station name"        "end station latitude"   
## [11] "end station longitude"   "bikeid"                 
## [13] "usertype"                "birth year"             
## [15] "gender"

Combining the the months into quarters because they are large. The total size of all 12 months is 3.53 Gigabytes. That is why we are going to do most of the data manipulation by quarter.

q1_2020 <- bind_rows(January2020, February2020, March2020)
q2_2020 <- bind_rows(April2020, May2020, June2020)
q3_2020 <- bind_rows(July2020, August2020, September2020)
q4_2020 <- bind_rows(October2020, November2020, December2020)

Renaming and recoding variables

As you can see that many column names have space in between, which is not valid variable name in R, then we need to rename the columns. We will recode the variable usertype values from Subscriber to member and Customer to casual.

q1_2020 <- rename(q1_2020,
                       startstation_id = "start station id",
                       startstation_name = "start station name",
                       start_lat = "start station latitude",
                       start_lng = "start station longitude",
                       endstation_id = "end station id",
                       endstation_name = "end station name",
                       end_lat = "end station latitude",
                       end_lng = "end station longitude",
                       birthyear = "birth year")
q1_2020 <-  q1_2020 %>% 
  mutate(usertype = recode(usertype, "Subscriber" = "member"
                                ,"Customer" = "casual"))

                  
q2_2020 <- rename(q2_2020,
                      startstation_id = "start station id",
                      startstation_name = "start station name",
                      start_lat = "start station latitude",
                      start_lng = "start station longitude",
                      endstation_id = "end station id",
                      endstation_name = "end station name",
                      end_lat = "end station latitude",
                      end_lng = "end station longitude",
                      birthyear = "birth year")
q2_2020 <-  q2_2020 %>% 
  mutate(usertype = recode(usertype, "Subscriber" = "member"
                           ,"Customer" = "casual"))

q3_2020 <- rename(q3_2020,
                      startstation_id = "start station id",
                      startstation_name = "start station name",
                      start_lat = "start station latitude",
                      start_lng = "start station longitude",
                      endstation_id = "end station id",
                      endstation_name = "end station name",
                      end_lat = "end station latitude",
                      end_lng = "end station longitude",
                      birthyear = "birth year")
q3_2020 <-  q3_2020 %>% 
  mutate(usertype = recode(usertype, "Subscriber" = "member"
                           ,"Customer" = "casual"))

q4_2020 <- rename(q4_2020,
                      startstation_id = "start station id",
                      startstation_name = "start station name",
                      start_lat = "start station latitude",
                      start_lng = "start station longitude",
                      endstation_id = "end station id",
                      endstation_name = "end station name",
                      end_lat = "end station latitude",
                      end_lng = "end station longitude",
                      birthyear = "birth year")
q4_2020 <-  q4_2020 %>% 
  mutate(usertype = recode(usertype, "Subscriber" = "member"
                           ,"Customer" = "casual"))

The variable gender is double where “0” is unknown, “1” is male, and “2” is female. We will change gender to character make it easy to understand.

q1_2020 <-  q1_2020 %>% 
  mutate(gender = recode(gender, "0" = "unknown","1" = "male", 
                         "2" = "female" ))

q2_2020 <-  q2_2020 %>% 
  mutate(gender = recode(gender, "0" = "unknown","1" = "male", 
                         "2" = "female" ))

q3_2020 <-  q3_2020 %>% 
  mutate(gender = recode(gender, "0" = "unknown","1" = "male", 
                         "2" = "female" ))
q4_2020 <-  q4_2020 %>% 
  mutate(gender = recode(gender, "0" = "unknown","1" = "male", 
                         "2" = "female" ))

Removing missing data and finding outliers

First we are going to remove missing values, then use descripitive statistics to find outliers for numerical variables

q1_2020[complete.cases(q1_2020), ]
## # A tibble: 3,455,883 x 15
##    tripduration starttime           stoptime            startstation_id
##           <dbl> <dttm>              <dttm>                        <dbl>
##  1          789 2020-01-01 00:00:55 2020-01-01 00:14:05             504
##  2         1541 2020-01-01 00:01:08 2020-01-01 00:26:49            3423
##  3         1464 2020-01-01 00:01:42 2020-01-01 00:26:07            3687
##  4          592 2020-01-01 00:01:45 2020-01-01 00:11:38             346
##  5          702 2020-01-01 00:01:45 2020-01-01 00:13:28             372
##  6         1740 2020-01-01 00:02:04 2020-01-01 00:31:04             394
##  7          203 2020-01-01 00:02:06 2020-01-01 00:05:29            3093
##  8         1758 2020-01-01 00:02:19 2020-01-01 00:31:37             514
##  9          114 2020-01-01 00:03:49 2020-01-01 00:05:44             274
## 10          417 2020-01-01 00:03:53 2020-01-01 00:10:50             467
## # ... with 3,455,873 more rows, and 11 more variables: startstation_name <chr>,
## #   start_lat <dbl>, start_lng <dbl>, endstation_id <dbl>,
## #   endstation_name <chr>, end_lat <dbl>, end_lng <dbl>, bikeid <dbl>,
## #   usertype <chr>, birthyear <dbl>, gender <chr>
summary(q1_2020)
##   tripduration       starttime                      stoptime                  
##  Min.   :     61   Min.   :2020-01-01 00:00:55   Min.   :2020-01-01 00:05:29  
##  1st Qu.:    335   1st Qu.:2020-01-23 09:22:21   1st Qu.:2020-01-23 09:34:19  
##  Median :    563   Median :2020-02-13 17:30:43   Median :2020-02-13 17:42:39  
##  Mean   :    903   Mean   :2020-02-13 04:54:11   Mean   :2020-02-13 05:09:15  
##  3rd Qu.:    990   3rd Qu.:2020-03-05 08:06:52   3rd Qu.:2020-03-05 08:19:01  
##  Max.   :3318587   Max.   :2020-03-31 23:59:17   Max.   :2020-04-17 11:31:13  
##  startstation_id startstation_name    start_lat       start_lng     
##  Min.   :  72    Length:3455883     Min.   :40.66   Min.   :-74.02  
##  1st Qu.: 389    Class :character   1st Qu.:40.72   1st Qu.:-73.99  
##  Median : 518    Mode  :character   Median :40.74   Median :-73.98  
##  Mean   :1724                       Mean   :40.74   Mean   :-73.98  
##  3rd Qu.:3335                       3rd Qu.:40.76   3rd Qu.:-73.97  
##  Max.   :3919                       Max.   :40.82   Max.   :-73.90  
##  endstation_id  endstation_name       end_lat         end_lng      
##  Min.   :  72   Length:3455883     Min.   :40.66   Min.   :-74.07  
##  1st Qu.: 387   Class :character   1st Qu.:40.72   1st Qu.:-73.99  
##  Median : 517   Mode  :character   Median :40.74   Median :-73.98  
##  Mean   :1716                      Mean   :40.74   Mean   :-73.98  
##  3rd Qu.:3335                      3rd Qu.:40.76   3rd Qu.:-73.97  
##  Max.   :3919                      Max.   :40.82   Max.   :-73.90  
##      bikeid        usertype           birthyear       gender         
##  Min.   :14530   Length:3455883     Min.   :1885   Length:3455883    
##  1st Qu.:28657   Class :character   1st Qu.:1969   Class :character  
##  Median :34195   Mode  :character   Median :1983   Mode  :character  
##  Mean   :32906                      Mean   :1980                     
##  3rd Qu.:39831                      3rd Qu.:1990                     
##  Max.   :43878                      Max.   :2004
q2_2020[complete.cases(q2_2020), ]
## # A tibble: 4,052,925 x 15
##    tripduration starttime           stoptime            startstation_id
##           <dbl> <dttm>              <dttm>                        <dbl>
##  1          534 2020-04-01 00:00:15 2020-04-01 00:09:09            3656
##  2          529 2020-04-01 00:02:28 2020-04-01 00:11:18            3163
##  3         1488 2020-04-01 00:02:56 2020-04-01 00:27:44            3164
##  4          341 2020-04-01 00:03:25 2020-04-01 00:09:06            3827
##  5         2206 2020-04-01 00:05:35 2020-04-01 00:42:21            3890
##  6          921 2020-04-01 00:05:50 2020-04-01 00:21:11            3697
##  7          819 2020-04-01 00:05:57 2020-04-01 00:19:37             297
##  8          780 2020-04-01 00:06:41 2020-04-01 00:19:41            3101
##  9          865 2020-04-01 00:10:05 2020-04-01 00:24:30            3244
## 10          365 2020-04-01 00:10:27 2020-04-01 00:16:33            3041
## # ... with 4,052,915 more rows, and 11 more variables: startstation_name <chr>,
## #   start_lat <dbl>, start_lng <dbl>, endstation_id <dbl>,
## #   endstation_name <chr>, end_lat <dbl>, end_lng <dbl>, bikeid <dbl>,
## #   usertype <chr>, birthyear <dbl>, gender <chr>
summary(q2_2020)
##   tripduration       starttime                      stoptime                  
##  Min.   :     61   Min.   :2020-04-01 00:00:15   Min.   :2020-04-01 00:09:06  
##  1st Qu.:    536   1st Qu.:2020-05-09 18:37:09   1st Qu.:2020-05-09 18:59:57  
##  Median :   1043   Median :2020-05-30 13:20:50   Median :2020-05-30 13:47:33  
##  Mean   :   1632   Mean   :2020-05-26 18:06:10   Mean   :2020-05-26 18:33:23  
##  3rd Qu.:   1736   3rd Qu.:2020-06-16 06:58:28   3rd Qu.:2020-06-16 07:22:14  
##  Max.   :3786187   Max.   :2020-06-30 23:59:53   Max.   :2020-07-05 19:32:25  
##  startstation_id startstation_name    start_lat       start_lng     
##  Min.   :  72    Length:4052925     Min.   :40.66   Min.   :-74.02  
##  1st Qu.: 427    Class :character   1st Qu.:40.71   1st Qu.:-73.99  
##  Median :3101    Mode  :character   Median :40.74   Median :-73.98  
##  Mean   :2068                       Mean   :40.74   Mean   :-73.98  
##  3rd Qu.:3440                       3rd Qu.:40.77   3rd Qu.:-73.96  
##  Max.   :4038                       Max.   :40.84   Max.   :-73.90  
##  endstation_id  endstation_name       end_lat         end_lng      
##  Min.   :  72   Length:4052925     Min.   :40.66   Min.   :-74.07  
##  1st Qu.: 426   Class :character   1st Qu.:40.71   1st Qu.:-73.99  
##  Median :3096   Mode  :character   Median :40.74   Median :-73.98  
##  Mean   :2061                      Mean   :40.74   Mean   :-73.98  
##  3rd Qu.:3440                      3rd Qu.:40.77   3rd Qu.:-73.96  
##  Max.   :4038                      Max.   :40.84   Max.   :-73.90  
##      bikeid        usertype           birthyear       gender         
##  Min.   :14529   Length:4052925     Min.   :1873   Length:4052925    
##  1st Qu.:29071   Class :character   1st Qu.:1969   Class :character  
##  Median :36833   Mode  :character   Median :1984   Mode  :character  
##  Mean   :34062                      Mean   :1981                     
##  3rd Qu.:40942                      3rd Qu.:1991                     
##  Max.   :44542                      Max.   :2004
q3_2020[complete.cases(q3_2020), ]
## # A tibble: 6,923,547 x 15
##    tripduration starttime           stoptime            startstation_id
##           <dbl> <dttm>              <dttm>                        <dbl>
##  1          341 2020-07-01 00:00:01 2020-07-01 00:05:42            3463
##  2          321 2020-07-01 00:00:01 2020-07-01 00:05:22            3463
##  3         2710 2020-07-01 00:00:06 2020-07-01 00:45:17             426
##  4         2685 2020-07-01 00:00:13 2020-07-01 00:44:58             426
##  5          191 2020-07-01 00:00:13 2020-07-01 00:03:24            3615
##  6          168 2020-07-01 00:00:17 2020-07-01 00:03:06            3656
##  7          514 2020-07-01 00:00:21 2020-07-01 00:08:55            3318
##  8          712 2020-07-01 00:00:31 2020-07-01 00:12:23            3523
##  9         1307 2020-07-01 00:00:35 2020-07-01 00:22:23             477
## 10         2281 2020-07-01 00:00:37 2020-07-01 00:38:38             474
## # ... with 6,923,537 more rows, and 11 more variables: startstation_name <chr>,
## #   start_lat <dbl>, start_lng <dbl>, endstation_id <dbl>,
## #   endstation_name <chr>, end_lat <dbl>, end_lng <dbl>, bikeid <dbl>,
## #   usertype <chr>, birthyear <dbl>, gender <chr>
summary(q3_2020)
##   tripduration       starttime                      stoptime                  
##  Min.   :     61   Min.   :2020-07-01 00:00:01   Min.   :2020-07-01 00:03:06  
##  1st Qu.:    468   1st Qu.:2020-07-26 17:58:56   1st Qu.:2020-07-26 18:23:38  
##  Median :    854   Median :2020-08-19 18:50:06   Median :2020-08-19 19:11:54  
##  Mean   :   1445   Mean   :2020-08-18 10:19:46   Mean   :2020-08-18 10:43:51  
##  3rd Qu.:   1485   3rd Qu.:2020-09-10 17:09:26   3rd Qu.:2020-09-10 17:27:33  
##  Max.   :3522382   Max.   :2020-09-30 23:59:57   Max.   :2020-10-13 10:42:27  
##  startstation_id startstation_name    start_lat       start_lng     
##  Min.   :  72    Length:6923547     Min.   :40.66   Min.   :-74.03  
##  1st Qu.: 426    Class :character   1st Qu.:40.72   1st Qu.:-73.99  
##  Median :3100    Mode  :character   Median :40.74   Median :-73.98  
##  Mean   :2082                       Mean   :40.74   Mean   :-73.98  
##  3rd Qu.:3507                       3rd Qu.:40.76   3rd Qu.:-73.96  
##  Max.   :4202                       Max.   :40.85   Max.   :-73.89  
##  endstation_id  endstation_name       end_lat         end_lng      
##  Min.   :  72   Length:6923547     Min.   :40.66   Min.   :-74.07  
##  1st Qu.: 426   Class :character   1st Qu.:40.72   1st Qu.:-73.99  
##  Median :3095   Mode  :character   Median :40.74   Median :-73.98  
##  Mean   :2075                      Mean   :40.74   Mean   :-73.98  
##  3rd Qu.:3503                      3rd Qu.:40.76   3rd Qu.:-73.96  
##  Max.   :4202                      Max.   :40.85   Max.   :-73.89  
##      bikeid        usertype           birthyear       gender         
##  Min.   :14529   Length:6923547     Min.   :1873   Length:6923547    
##  1st Qu.:29090   Class :character   1st Qu.:1969   Class :character  
##  Median :37804   Mode  :character   Median :1985   Mode  :character  
##  Mean   :35250                      Mean   :1982                     
##  3rd Qu.:42802                      3rd Qu.:1992                     
##  Max.   :47855                      Max.   :2004
q4_2020[complete.cases(q4_2020), ]
## # A tibble: 5,074,502 x 15
##    tripduration starttime           stoptime            startstation_id
##           <dbl> <dttm>              <dttm>                        <dbl>
##  1         1403 2020-10-01 00:00:10 2020-10-01 00:23:34             224
##  2          462 2020-10-01 00:00:18 2020-10-01 00:08:00            3161
##  3          770 2020-10-01 00:00:21 2020-10-01 00:13:12             354
##  4          233 2020-10-01 00:00:25 2020-10-01 00:04:18            3141
##  5          126 2020-10-01 00:00:25 2020-10-01 00:02:32             335
##  6         1632 2020-10-01 00:00:32 2020-10-01 00:27:45             502
##  7          234 2020-10-01 00:00:32 2020-10-01 00:04:27            2006
##  8          523 2020-10-01 00:00:41 2020-10-01 00:09:24            3435
##  9          737 2020-10-01 00:00:43 2020-10-01 00:13:00             480
## 10         3510 2020-10-01 00:00:43 2020-10-01 00:59:13            4043
## # ... with 5,074,492 more rows, and 11 more variables: startstation_name <chr>,
## #   start_lat <dbl>, start_lng <dbl>, endstation_id <dbl>,
## #   endstation_name <chr>, end_lat <dbl>, end_lng <dbl>, bikeid <dbl>,
## #   usertype <chr>, birthyear <dbl>, gender <chr>
summary(q4_2020)
##   tripduration       starttime                      stoptime                  
##  Min.   :     61   Min.   :2020-10-01 00:00:10   Min.   :2020-10-01 00:02:32  
##  1st Qu.:    395   1st Qu.:2020-10-17 16:44:57   1st Qu.:2020-10-17 17:08:32  
##  Median :    705   Median :2020-11-06 06:42:29   Median :2020-11-06 06:56:34  
##  Mean   :   1149   Mean   :2020-11-07 21:30:24   Mean   :2020-11-07 21:49:33  
##  3rd Qu.:   1246   3rd Qu.:2020-11-27 07:15:38   3rd Qu.:2020-11-27 07:34:28  
##  Max.   :2687593   Max.   :2020-12-31 23:59:57   Max.   :2021-01-04 18:16:54  
##  startstation_id startstation_name    start_lat       start_lng     
##  Min.   :  72    Length:5074502     Min.   :40.66   Min.   :-74.03  
##  1st Qu.: 436    Class :character   1st Qu.:40.72   1st Qu.:-73.99  
##  Median :3119    Mode  :character   Median :40.74   Median :-73.98  
##  Mean   :2123                       Mean   :40.74   Mean   :-73.98  
##  3rd Qu.:3526                       3rd Qu.:40.77   3rd Qu.:-73.96  
##  Max.   :4332                       Max.   :40.85   Max.   :-73.88  
##  endstation_id  endstation_name       end_lat         end_lng      
##  Min.   :  72   Length:5074502     Min.   :40.66   Min.   :-74.07  
##  1st Qu.: 435   Class :character   1st Qu.:40.72   1st Qu.:-73.99  
##  Median :3117   Mode  :character   Median :40.74   Median :-73.98  
##  Mean   :2119                      Mean   :40.74   Mean   :-73.98  
##  3rd Qu.:3526                      3rd Qu.:40.77   3rd Qu.:-73.96  
##  Max.   :4332                      Max.   :40.85   Max.   :-73.88  
##      bikeid        usertype           birthyear       gender         
##  Min.   :14529   Length:5074502     Min.   :1873   Length:5074502    
##  1st Qu.:34700   Class :character   1st Qu.:1969   Class :character  
##  Median :40813   Mode  :character   Median :1984   Mode  :character  
##  Mean   :38939                      Mean   :1981                     
##  3rd Qu.:46450                      3rd Qu.:1991                     
##  Max.   :50107                      Max.   :2004

By looking at the output of the summary statistics, we will notice issues with the variable tripduration and birthyear. Citi-Bike doesn’t allow people to hold a bike more than 24 hours, which is 86400 seconds, and we see values greater than that. For the variable birthyear we see values from the 1800s. For our purpose we will exclude tripduration greater than 84600 sec and birthyear less than 1931.

q1_2020 <- q1_2020[q1_2020$birthyear > 1930 &
                     q1_2020$tripduration <= 86400,]
q2_2020 <- q2_2020[q2_2020$birthyear > 1930 & 
                     q2_2020$tripduration <= 86400,]
q3_2020 <- q3_2020[q3_2020$birthyear > 1930 & 
                     q3_2020$tripduration <= 86400,]
q4_2020 <- q4_2020[q4_2020$birthyear > 1930 & 
                     q4_2020$tripduration <= 86400,]

Extracting variables from datetime

We are going to extract variables Month(Jan-Dec), weekday (Monday-Sunday), week(1-52), Day(1-31), and Hour(0-24) from starttime.

q1_2020$Month <- month.abb[month(as.Date(q1_2020$starttime))]
q1_2020$weekday <- weekdays(as.Date(q1_2020$starttime))
q1_2020$week <- as.numeric(format(q1_2020$starttime,"%W"))
q1_2020$Day <- day(as.Date(q1_2020$starttime))
q1_2020$Hour <- hour(q1_2020$starttime)

q2_2020$Month <- month.abb[month(as.Date(q2_2020$starttime))]
q2_2020$weekday <- weekdays(as.Date(q2_2020$starttime))
q2_2020$week <- as.numeric(format(q2_2020$starttime,"%W"))
q2_2020$Day <- day(as.Date(q2_2020$starttime))
q2_2020$Hour <- hour(q2_2020$starttime)


q3_2020$Month <- month.abb[month(as.Date(q3_2020$starttime))]
q3_2020$weekday <- weekdays(as.Date(q3_2020$starttime))
q3_2020$week <- as.numeric(format(q3_2020$starttime,"%W"))
q3_2020$Day <- day(as.Date(q3_2020$starttime))
q3_2020$Hour <- hour(q3_2020$starttime)

q4_2020$Month <- month.abb[month(as.Date(q4_2020$starttime))]
q4_2020$weekday <- weekdays(as.Date(q4_2020$starttime))
q4_2020$week <- as.numeric(format(q4_2020$starttime,"%W"))
q4_2020$Day <- day(as.Date(q4_2020$starttime))
q4_2020$Hour <- hour(q4_2020$starttime)

Creating new variables age and agegroup

q1_2020$age <- 2020 - q1_2020$birthyear
q1_2020["age_grp"] = cut(q1_2020$age, c(15, 30, 45, 60, 90),
                          c("Under30", "30-44", "45-59", "60plus"),
                          include.lowest=TRUE)

q2_2020$age <- 2020 - q2_2020$birthyear
q2_2020["age_grp"] = cut(q2_2020$age, c(15, 30, 45, 60, 90),
                         c("Under30", "30-44", "45-59", "60plus"),
                         include.lowest=TRUE)

q3_2020$age <- 2020 - q3_2020$birthyear
q3_2020["age_grp"] = cut(q3_2020$age, c(15, 30, 45, 60, 90),
                         c("Under30", "30-44", "45-59", "60plus"),
                         include.lowest=TRUE)

q4_2020$age <- 2020 - q4_2020$birthyear
q4_2020["age_grp"] = cut(q4_2020$age, c(15, 30, 45, 60, 90),
                         c("Under30", "30-44", "45-59", "60plus"),
                         include.lowest=TRUE)

Dropping unnecessary columns for the analysis

q1_2020 <- select(q1_2020, -c(stoptime, starttime, birthyear, bikeid, age))
q2_2020 <- select(q2_2020, -c(stoptime, starttime, birthyear, bikeid, age))
q3_2020 <- select(q3_2020, -c(stoptime, starttime, birthyear, bikeid, age))
q4_2020 <- select(q4_2020, -c(stoptime, starttime, birthyear, bikeid, age))

Combining all the 4 datasets into one dataset

This dataset is 3.1 GB, 19,484,799 rows and 17 columns. It is large.

nyc_bike_trips <- bind_rows(q1_2020, q2_2020, q3_2020, q4_2020)

ANALYSIS

First let’s find out who is the customer

round(prop.table(table(nyc_bike_trips$usertype, 
                       nyc_bike_trips$gender))*100, 2)
##         
##          female  male unknown
##   casual   5.54  8.16    9.62
##   member  22.93 52.33    1.42
round(prop.table(table(nyc_bike_trips$age_grp,
                       nyc_bike_trips$usertype))*100, 2)
##          
##           casual member
##   Under30   8.66  23.46
##   30-44     4.34  31.42
##   45-59    10.21  17.09
##   60plus    0.12   4.71
round(prop.table(table(nyc_bike_trips$age_grp, 
                       nyc_bike_trips$gender))*100, 2)
##          
##           female  male unknown
##   Under30  11.50 20.25    0.37
##   30-44    10.67 24.67    0.41
##   45-59     5.00 12.07   10.23
##   60plus    1.30  3.51    0.02

From the above tables we see that about 60% of total trips are made male, about 99% of total trips made by casual users are from people under the age 60 years old, about 99% of total trips made by customers with unknown gender are in the 45-59 age group, and around 77% of the total trips are made by member users.

# Let's find total trip time by user type
ttime <- nyc_bike_trips %>% group_by(usertype) %>% 
  summarise(total_time = sum(tripduration))
ttime
## # A tibble: 2 x 2
##   usertype  total_time
##   <chr>          <dbl>
## 1 casual    7651549394
## 2 member   13917547774

We see up here, 35.5% of total trip duration are contributed by casual users, even though they account only for about 23% of total trips.

Visualiztion

Before we start visualizing the data, we need to fix the order of weekday and month.

nyc_bike_trips$weekday <- ordered(nyc_bike_trips$weekday, 
                              levels=c( "Monday", "Tuesday", "Wednesday",
                            "Thursday", "Friday", "Saturday", "Sunday"))

nyc_bike_trips$Month <- ordered(nyc_bike_trips$Month, 
                        levels=c( "Jan", "Feb", "Mar","Apr", "May", "Jun",
                               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

Line plots

For the next four plots I decided to use line charts instead of bar charts because we want to find trends of each user type. In the two first plot we will be focused on the number of rides and average time of a ride by days of the week.

nyc_bike_trips %>% 
  group_by(usertype, weekday) %>% 
  summarise(rides = n()) %>% 
  arrange(usertype, weekday)  %>% 
  ggplot(aes(x = weekday, y = rides, 
  group = usertype, color = usertype)) + geom_line() + geom_point() +
  labs(title = "Total trips by usertype Vs. Days of the week") +
  xlab("Days of the week") + ylab("Number of rides") +
  scale_color_manual(values=c('blue','red'))
## `summarise()` has grouped output by 'usertype'. You can override using the `.groups` argument.

We see in the above figure that the number of total trips made by member users are higher than total trips made by casual users in every day of the week. For casual users the number of rides are higher on Saturday and Sunday while it is lower for member users.

nyc_bike_trips %>% 
  group_by(usertype, weekday) %>% 
  summarise(average_duration = mean(tripduration)) %>% 
  arrange(usertype, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, 
  group = usertype, color = usertype)) + geom_line() + geom_point()+
  labs(title = "Days of the week Vs. Average trip duration") +
  xlab("Days of the week") + ylab("Average ride duration in sec.") +
  scale_color_manual(values=c('blue','red'))
## `summarise()` has grouped output by 'usertype'. You can override using the `.groups` argument.

Above, we see that the average trip duration higher for casual users than member users for every day of the week and two lines look identical.

nyc_bike_trips %>% 
  group_by(usertype, Hour) %>% 
  summarise(rides = n()) %>%
  arrange(usertype, Hour)  %>% 
  ggplot(aes(x = Hour, y = rides, 
  group = usertype, color = usertype)) + geom_line()+ geom_point() +
  labs(title = "Total trips by usertype through the day") +
  xlab("Time of the day") + ylab("Number of rides") +
  scale_color_manual(values=c('blue','red'))
## `summarise()` has grouped output by 'usertype'. You can override using the `.groups` argument.

In the above plot we see that from midnight to 5am the total number of rides are very close between the two groups, but after 5am the difference grows exponentially faster. For both groups the total number trips peak at 5pm.

nyc_bike_trips %>% 
  group_by(usertype, Hour) %>% 
  summarise(average_duration = mean(tripduration)) %>% 
  arrange(usertype, Hour)  %>% 
  ggplot(aes(x = Hour, y = average_duration, 
  group = usertype, color = usertype)) + geom_line() + geom_point() +
  labs(title = "Average trip duration by usertype through the day") +
  xlab("Time of the day") + ylab("Average trip duration in sec.") +
  scale_color_manual(values=c('blue','red'))
## `summarise()` has grouped output by 'usertype'. You can override using the `.groups` argument.

We see above that the average duration time are higher for casual users than member users at every time of the day.

Heat maps

Now let’s get our focus on the whole year, to do that we will use calendar heat map.

nyc_bike_trips %>% 
  group_by(weekday,week, usertype) %>% 
  summarise(rides = n())%>%
  ggplot(aes(x = week, y = weekday, fill = rides)) +
  viridis::scale_fill_viridis(name="NYC City Bike") +
  geom_tile(color = 'white', size = 0.1 ) +
  coord_fixed(ratio = 2) + 
  scale_fill_gradient(low="green", high="red") + 
  scale_x_continuous( expand = c(0, 0), breaks = seq(1, 52, length = 12),             
    labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
    "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) + facet_grid(usertype ~.)+
  labs(title = "Calendar heatmap of total trips")
## `summarise()` has grouped output by 'weekday', 'week'. You can override using the `.groups` argument.
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

We see above, that the total number of rides are higher for member users than casual users for every day of week and every week of the year. For casual users the total number of trips is higher on Saturday and Sunday from March to November. In the other hand, the total number of trips are on the weekdays except March and April.

nyc_bike_trips %>% 
  group_by(weekday,week, usertype) %>% 
  summarise(av_duration = mean(tripduration))%>%
  ggplot(aes(x = week, y = weekday, fill = av_duration)) +
  viridis::scale_fill_viridis(name="NYC City Bike") +
  geom_tile(color = 'white', size = 0.1 ) +
  coord_fixed(ratio = 2) + 
  scale_fill_gradient(low="green", high="red") +
  scale_x_continuous( expand = c(0, 0), breaks = seq(1, 52, length = 12),
  labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
      "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))+facet_grid(usertype ~.)+
  labs(title = "Calendar heatmap of average trip duration in seconds")
## `summarise()` has grouped output by 'weekday', 'week'. You can override using the `.groups` argument.
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

We can see from the above heat map that the average trip duration are higher for casual users than member users for every day of week and every week of the year. The average is little higher for both groups for Saturday and Sunday, except March to June.

nyc_bike_trips %>% 
  group_by(weekday,week, usertype) %>% 
  summarise(med_duration = median(tripduration))%>%
  ggplot(aes(x = week, y = weekday, fill = med_duration)) +
  viridis::scale_fill_viridis(name="NYC City Bike") +
  geom_tile(color = 'white', size = 0.1 ) +
  coord_fixed(ratio = 2) + 
  scale_fill_gradient(low="green", high="red") +
  scale_x_continuous( expand = c(0, 0), breaks = seq(1, 52, length = 12),
                      labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
      "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))+facet_grid(usertype ~.)+
  labs(title = "Calendar heatmap of median trip duration in seconds")
## `summarise()` has grouped output by 'weekday', 'week'. You can override using the `.groups` argument.
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

We can see from the above heat map that the median trip duration are higher for casual users than member users for every day of week and every week of the year. For both groups it is higher on Saturday and Sunday, except March to June for casual users.

MAPS

To really understand this the data we need to see where the activities are happening.

First we need to create a New York City stamen map

nyc_bb <- c(left = min(nyc_bike_trips$start_lng), 
            bottom = min(nyc_bike_trips$start_lat), 
            right = max(nyc_bike_trips$start_lng), 
            top = max(nyc_bike_trips$start_lat))

nyc_st <- get_stamenmap(bbox = nyc_bb, zoom = 12, maptype = "terrain")
## Source : http://tile.stamen.com/terrain/12/1205/1537.png
## Source : http://tile.stamen.com/terrain/12/1206/1537.png
## Source : http://tile.stamen.com/terrain/12/1207/1537.png
## Source : http://tile.stamen.com/terrain/12/1205/1538.png
## Source : http://tile.stamen.com/terrain/12/1206/1538.png
## Source : http://tile.stamen.com/terrain/12/1207/1538.png
## Source : http://tile.stamen.com/terrain/12/1205/1539.png
## Source : http://tile.stamen.com/terrain/12/1206/1539.png
## Source : http://tile.stamen.com/terrain/12/1207/1539.png
## Source : http://tile.stamen.com/terrain/12/1205/1540.png
## Source : http://tile.stamen.com/terrain/12/1206/1540.png
## Source : http://tile.stamen.com/terrain/12/1207/1540.png

Let’s find the top 100 popular stations by usertype

# Sorting stations by the total number of ride
stations <- nyc_bike_trips %>%
  group_by(start_lng, start_lat, startstation_id, startstation_name,usertype) %>% 
  summarize(nTrips = n()) %>% 
  arrange(desc(nTrips)) 
## `summarise()` has grouped output by 'start_lng', 'start_lat', 'startstation_id', 'startstation_name'. You can override using the `.groups` argument.
# Getting the top 100 popular stations for member users
top_100member <- stations %>% filter(usertype == "member")
top_100member <- top_100member[1:100,]

# Getting the top 100 popular stations for casusal users
top100Causual <- stations %>% filter(usertype == "casual")
top100Causual <- top100Causual[1:100,]

There are more than 1000 bike stations in New York city, we will focus only on the top 100 of each usertype.

Now let’s see the top 100 for member users

ggmap(nyc_st, darken = c(.3,"#FFFFFF")) +
  geom_point(data = top_100member %>% 
  group_by(longitude = start_lng, latitude = start_lat) %>%
  summarize(rides = sum(nTrips)), aes(x = longitude, y = latitude,
  color = rides),size = 2, alpha = 1.5) +
  scale_colour_gradient(low = "blue", high = "red") +
  coord_cartesian(xlim = c(-74.021, -73.92), ylim = c(40.65, 40.81))+
  xlab("Longitude") + ylab("Latitude") +
  labs(title = "The top 100 stations for member users")
## `summarise()` has grouped output by 'longitude'. You can override using the `.groups` argument.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

For member users we see that most of the top 100 stations are north of Canal Street and south of 79 Street. Only 2 stations in Brooklyn the only ones outside of Manhattan. The stations are tight to each other and specially in the area between north of Delancy street, south 42 street, west of First Ave. and East of 10th Ave.

Creating a csv file of the clean data for futher analysis or visualizations in other tools like SQL, Tableau, Power BI, etc.

Datasets by quarter

write.csv(q1_2020, “D:\Documents\Google Analytics\Case Study\q1_2020.csv”)

write.csv(q2_2020, “D:\Documents\Google Analytics\Case Study\q2_2020.csv”)

write.csv(q3_2020, “D:\Documents\Google Analytics\Case Study\q3_2020.csv”)

write.csv(q4_2020, “D:\Documents\Google Analytics\Case Study\q4_2020.csv”)

The 3.1 GB dataset

write.csv(nyc_bike_trips, “D:\Documents\Google Analytics\Case Study\citibike_2020.csv”)

Key Takeaways

  • About 60% of total trips are made by male, 28% made by female, and about 11% made by gender unknown.
  • Around 77% of total trips are made by member users.
  • Casual users contribute to 35.5% of total duration time, while only contributing to 23% of total trips.
  • Member users commute mostly in Manhattan south of 79 street, between 1st and 10th avenue, mostly on weekdays.
  • Casual users mostly rides on the weekend, and frequent touristic attractions.

Conclusion

  • Given that Citi-Bike charge by trip duration and we see that casual users trip duration average is higher than member users, with the data we have, we cannot conclude that turning casual users to member users will increase revenue.
  • We need more data on casual users to answer the most important question in my opinion which is are casual users one time users or multiple times users ? We know that New York City is full of tourists daily specially on the weekend, but we can not assume they are the casual users, we need information such as user id.
  • The need to answer that question is that we will not waste time and energy trying to convert a one time user, or out of state resident to member user type.